Non-record: ASQU activation, Mixture of Convolutions, BankedLinear by andrewmouldon · Pull Request #679 · openai/parameter-golf

andrewmouldon · 2026-03-25T04:54:36Z

Explores architectural changes aimed at improving capacity per parameter, even if slower than typical leaderboard approaches:

ASQU (Asymmetric Squared Unit): learned per-channel generalization of ReLU^2
MoC (Mixture of Convolutions): token-conditioned mixture of short convolutions (dynamic per-token kernels)
BankedLinear: shared weight bank across layers (small learned + large fixed random set) with learned per-layer mixing

Ablations use the base train_gpt.py script for a fixed 10k steps. MLP expansion is adjusted to match size.

ASQU replaces ReLU^2
Short conv / MoC is applied to QKV
BankedLinear replaces QKV projections

Results

Model Variant	Pre-quant BPB	Post-quant BPB	Size (bytes)	MLP Mult
Baseline	1.2262	1.2328	15861272	2.00
+ ASQU	1.2232	1.2301	15898146	2.00
+ Short Conv (k=1)	1.2157	1.2217	15973462	1.99
+ MoC (k=8)	1.2121	1.2182	15911167	1.93
+ BankedLinear	1.2098	1.2164	15852659	2.6

Results are currently single-seed (1337); additional runs in progress.

ASQU: per-channel parameterization gives ~0.001 bpb improvement over scalar, where the scalar form converges to behavior similar to leaky ReLU^2 with slope 0.5

Also explored learning the exponent instead of fixing the square. While it did not consistently improve final performance (and was more expensive), it revealed a consistent depth-dependent pattern:

early layers ~1.4
middle layers ~1.8
late layers ~2.2

With MoC, experiments of generating the dynamic kernel via learned projection performed poorly, suggesting the use of a more constrained mechanism (e.g. basis interpolation) is necessary for stable optimization.

For BankedLinear, one scalar weight is used per bank entry to construct the mixture. Experiments with per head weights worsened performance.

Experiments were done with a layer specific LoRA, but this worsened performance compared to just investing capacity in the MLP.

…EADME.md

andrewmouldon added 7 commits March 24, 2026 23:37

Create 2026-03-24_ASQU_MoC_BankLinear.README.md

7aedb19

Delete records/track_non_record_16mb/2026-03-24_ASQU_MoC_BankLinear.R…

6a76051

…EADME.md

Create README.md

542b187

Add files via upload

4d637e2

Add files via upload

e51303e

Update submission.json

2ef40d5

Update README.md

0e023b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679

Non-record: ASQU activation, Mixture of Convolutions, BankedLinear#679
andrewmouldon wants to merge 7 commits intoopenai:mainfrom
andrewmouldon:main

andrewmouldon commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewmouldon commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewmouldon commented Mar 25, 2026 •

edited

Loading